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Abstract. Conditional random field (CRF) and Structural Support Vector 
Machine (SVM) are two state-of-the-art algorithms for structured prediction, 
which captures the interdependency among output variables. The success of 
these algorithms is attributed to the fact that their discriminative models can 
account for overlapping features on the whole input observations. These fea- 
tures are usually generated by applying a given set of templates on labeled 
data, but improper templates may lead to degraded performance. To allevi- 
ate this issue, in this paper, we propose a novel multiple template learning 
paradigm to learn structured prediction and the importance of each template 
simultaneously, so that arbitrary templates could be added into the learning 
model without caution. This paradigm can be formulated as a special multi- 
ple kernel learning problem with exponential number of constraints. Then we 
introduce an efficient cutting plane algorithm to solve this problem in the pri- 
mal. We also evaluate the proposed learning paradigm on two widely-studied 
structured prediction tasks, i.e. sequence labeling and dependency parsing. 
Extensive experimental results show that the proposed method outperforms 
CRFs and Structural SVMs due to exploiting the importance of each template. 

1. Introduction 

Structured prediction [T3l EH [30] has been successfully applied to the problems 
with strong interdependence among the output variables. In the realm of Nat- 
ural Language Processing (NLP), many tasks could be formulated as structured 
prediction problems. A typical example is part-of-speech tagging which assigns 
a specific part-of-speech tag to each token of an input sentence. The tag of one 
token is strongly correlated with the tags of its neighbors under the linear chain 
dependency [13j [30] . More complicated structured output dependencies could be 
a tree or graphs, such as Context-Free Grammar (CFG) [30], dependency parsing 
tree [T7J [T5] , and factor graph for relation extraction [37] . Note that there exist ex- 
act inference methods for sequence and tree structures. For the tasks with general 
output structures {e.g., the pairwise fully connected undirected graph), however, 
the exact inference problem is intractable. In such cases, approximate inference is 
usually pursued to obtain an approximate solution 

The major advantage of structured prediction models such as Conditional Ran- 
dom Fields (CRFs) and Structural Support Vector Machines (SVMs) is that their 
learning models can easily integrate prior knowledge of a specific domain by feature 
engineering. For example, the discriminative models of CRFs can account for over- 
lapping features (e.g., first-order or even higher-order linear chain) on the whole 
observation sequence [13] . On the other hand, Structural SVMs [3UJII2] relies on the 
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joint feature maps over the input-output pairs, where features can be represented 
equivalently as that of CRFs. 

During last decade, structured prediction algorithms take more effort on how 
to model the interdependence among the output variables, but less consideration 
is taken on the feature engineering which is a non-trivial task for general users. 
We observe that different kinds of rules are used to extract features from input 
sentences in the sequence learning and dependency parsing. The arbitrary non- 
independent features are usually extracted from a given input-output pair using a 
set of predefined templates (rules or feature functions). Templates can be arbitrar- 
ily defined according to specific applications by exploring any internal or external 
knowledge as much as possible, and then are added into the learning models in 
order to boost prediction performance. However, features generated from arbitrary 
templates may be redundant or non-informative. Structured prediction models, 
such as CRFs and Structural SVMs, only treat the features generated from each 
template equally without exploiting the importance of each template or its gen- 
erated features. Therefore, some improper templates may generate conflicting or 
noisy features which can degrade these structured prediction models. 

In this paper, we propose a Multiple Template Learning (MTL) paradigm to 
learn the weight of each template and the structured prediction model, simultane- 
ously. Specifically, given a set of predefined templates, the features extracted from 
each individual template can be inherently formed as a group. Learning the weight- 
ing of these groups is formulated as a Multiple Kernel Learning (MKL) problem. 
This specific MKL problem usually involves exponential number of constraints due 
to the interdependence among output variables. We propose to solve this MKL 
problem in the primal by an efficient cutting plane algorithm. The proposed MTL 
paradigm can be easily instantiated for specific applications which inherits from 
Structural SVMs. Moreover, two well-known structured prediction tasks, sequence 
labeling and dependency parsing, are showcased in this paper. Extensive exper- 
imental results demonstrate that the proposed paradigm can automatically learn 
the importance of each templates for the structured prediction tasks. The learned 
weights can avoid degraded performance which is caused by adding poorly designed 
or even conflicting templates into the learning model, so users can define tem- 
plates without cautions. Moreover, MTL helps boost the prediction performance 
by weighting the features among different groups. 

The rest of this paper is organized as follows: Section [5] briefly reviews related 
work of structured prediction and multiple kernel learning. In Section [31 we first 
describe the relationship between features and templates in natural language pro- 
cessing tasks. Then the proposed MTL framework for structured prediction is 
presented. Section [4] and Section [5] discuss two showcases, sequence labeling and 
dependency parsing, in the MTL framework and related experimental results, re- 
spectively. Finally, Section [B] gives conclusive remarks. 



2. Related Work 

There are a large number of real applications which have already been explored 
by structured prediction. We mainly review the work closely related to this paper 
such as sequence learning and dependency parsing, where inference problems can be 
solved exactly. The primary idea of sequence labeling is to learn from observation 
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sequences and then predict label sequences instead of individual class labels. Fun- 
damental tasks in NLP, such as Chinese word segmentation [23, 38 , part-of-speech 
tagging [T3J [TJ [TJ], and named entity recognition [151 1221 EH]: can achieve great 
improvements due to the development of label sequence learning models, such as 
Hidden Markov models (HMMs) [24], Conditional Random Fields (CRFs) [13], Hid- 
den Markov Support Vector Machines (HM-SVM)[T] and Structural SVMs [30 l fl2 ] . 
In addition, sequence labeling has been successfully applied in other areas such as 
speech recognition, computational biology and system identification. Different from 
sequence labeling, dependency parsing is to learn a directed tree structure which 
captures the syntactic relationships between two words in a sentence. Graph-based 
method [3 [HI [19] and transition-based method[36j [20l [3] are considered to be the 
state-of-the-art for dependency parsing 21]. It has been used ranging from question 
answering [33 to relation extraction [7j. 

The proposed MTL extends Structural SVM [301 CE2] for structured and in- 
terdependent output variables to take care of the feature engineering problem. 
Structural SVM takes the joint feature map ^(x, y) over the input-output pair 
(x, y) with the input x G X and its output y £ y by concatenating differ- 
ent kinds of features together. The discriminant function is defined as /(x) = 
argmaxy'gy /i(x, y') where compatibility functions /i(x, y) assume to be linear in 
^(x, y), i.e. /i(x, y) = (w, ^(x, y)} as a w-parameterized function. Given a convex 
loss function L(x, y;w), Structural SVM is formulated as a minimization of the 
regularized empirical risk: 

\ (j " 

(!) min oll w ll 2 + — Y]£( x i>yi; w )> 

w 2 n * — ' 

i—l 

where the common choices are the structured hinge loss and logistic loss in CRFs: 

(2) L Wn9e (x,y;w) = maxA(y,y') + /i(x,y'; w) - h(x,y; w) 

y'ey 

(3) L C flFs(x,y;w) = log exp(/i(x, y'; w) - h(x, y; w)) 

y'ey 

where A(y,y') is a user-defined cost function. This cost penalizes the output y' 
violating a margin constraint involving y' ^ y. 

Feature selection for CRFs has been explored for activity recognition by li reg- 
ularization [31]. Progressively reducing the features by i\ regularization method 
may degrade the performance greatly due to the interdependence among features in 
the same group. MTL considers that features are naturally group together by tem- 
plates in terms of Multiple Kernel Learning which was first proposed by [14] , and 
the connection with ^2,1 mixed norm was proposed by [4] to learn group sparsity for 
independent output variables. There is lots of work focusing on convergence and 
scalability [371 H3 [Ml [55] . [35] formulated MKL for multiple classification problem 
in terms of structural prediction, but only focused on multiple classification. It is 
intractable for these MKL methods to deal with structured prediction due to the 
exponential number of constraints. Recently, |15j proposed multiple kernel learning 
for structural prediction training in online mode; while MTL is trained in batch 
mode. Moreover, we propose to use templates to generate natural feature groups 
for MKL. 
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3. Multiple Template Learning for Structured Prediction 

Conditional Random Fields (CRFs) and Structural SVMs can be used to model 
many structured prediction problems, such as sequence labeling with first-order or 
even higher-order linear chain, and syntactic parsing with directed tree structure. 
More general structures can be possible if a feasible inference method exists. CRFs 
are graphical models by defining conditional probability density function over the 
given graph structure, while Structural SVMs incorporate the structural informa- 
tion by defining joint feature mapping under large margin theory. Even though 
they model the same problem under different principles, they all confront the same 
difficulties: how to extract features from structured inputs for the learning algo- 
rithms so as to obtain better performance. There are lots of tasks where features 
are extracted by applying a set of predefined templates. Our proposed Multiple 
Template Learning (MTL) framework mainly focuses on how to efficiently and ef- 
fectively manage the given set of templates and learn a better structured prediction 
model simultaneously. 

3.f . Feature v.s. Template. For the problems with structured inputs, features 
are usually not explicitly defined. For instance, part-of-speech tagging in the area of 
NLP is to label each token with a specific tag in the given token sequence (sentence) . 
In addition to the current token, the neighbors and their associated tags can be 
considered as the important features to determine the tag of the current token. All 
these intuitive information can be represented by feature functions called templates. 
In this paper, we consider the templates as some predefined rules which are used 
for feature extraction task. 

Given a training dataset V = {(Xi, !Vi)}?=i wrt h n structured input-output 
pairs (Xi,yi) where Xi G X and 3^ € Y can be any structures, for example, 
sequence, tree, or general graph. Assume that there are m templates denoted 
by the functional Kj(.),Vj = 1, . . . , m over the domain of structural input-output 
pairs. By applying Kj(.) to all pairs of T>, we can instantiate a set of features 
Kj(T>) = . . . , Kj 3 } with dj different features. Therefore, the set of features 
over V are n{T>) = {ni(V) 7 . . . , n m (V)}. The size of total features \niV)\ may be 
smaller than Y^jLi \ K jC^)\ since one feature may be generated from more than one 
template. Each input-output pair now can be represented by a real value 

feature vector $(^,3^) using the same set of features in niV). Since n(D) is ex- 
tracted from V and a small subset of features in $(^,3^) could be activated, so 
each instance may have a very sparse feature representation. 

The previous methods for structured inputs and outputs did not consider the 
properties of features, and directly used k(D) as the feature representation. Taking 
part-of-speech tagging for example, CRFs use the concatenation of all the different 
features instantiated from each template and the real- value of each feature is the 
occurred frequency. Actually, the subset of features Kj (V) applied by the jth tem- 
plate can be naturally formulated as one group. Each group of features stands for 
some specific meaning of the applied template, such as word, bi-gram, the distance, 
direction, and position related to the current token. Moreover, templates can be 
either defined arbitrarily by persons without any prior knowledge, or designed spe- 
cially by domain experts. Due to the diversity of these templates, the importance 
of each templates should be different. 
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It is desirable to have a principle way to interpret these templates. In particular, 
a proper template weighting can prune away poorly designed or conflicting tem- 
plates, and amplify the effective templates, thereafter templates can be designed 
without cautions. To this end, in the next subsection, we propose a novel structured 
prediction model with group sparsity, where features generated from the same tem- 
plate naturally form a group, so as to interpret the importance of templates. The 
proposed model can be essentially deemed as a special Multiple Kernel Learning 
(MKL) problem, where the base kernels are defined in accord with templates. Then 
we solve this MKL in the primal by an efficient cutting plane algorithm. 

3.2. Model Formulation. Given a training dataset T>, each input-output pair 
(Xi,yi) can be represented by $(^,3^) = [$i(Xi, 3>i); • • • ; 3^)] using the 

group representation of k(T>) where semicolon is used to concatenate column vec- 
tors, for concise representation. The goal of structure learning is to learn hypotheses 

(4) / : X -> Y. 

According to Structural SVMs, the compatibility function F : X x Y — > R over 
the input-output pairs are pursued and the prediction function can be derived by 
maximizing F over the output space Y for a given input X 6 X. The general 
hypotheses / are the parameterized functions with parameter vector w as 

(5) /(#;w) = argrnaxF(A',3;;w). 

The linear parametric representation of F is usually used as F(X, y>; w) = w T <l>( X 1 y) 
For structural outputs, the standard zero-one cost function frequently used in clas- 
sification is not appropriate. Most applications need specific cost function, so we 
define the general cost function A(y, y') to quantify the cost if the instance with 
true output y is assigned to be y' . The margin re-scaling with general loss func- 
tions and linear penalty term can be formulated as a minimization of the regularized 
empirical risk : 

1 C - 

(6) mm d|w|| 2 + — 
w,£>o 2 n 

i—l 

s.t. V*,V# G Y : > A(^,X) - & 

where C is a trade-off parameter between training error minimization and margin 
maximization, and 5&(y') = $(^,3^) - <f>(Xi,y<). 

As mentioned in Section 13.11 group sparsity could be applied to this problem 
in terms of the natural formed groups of features. According to Support Kernel 
Machines (SKMs) [I], Problem © can be readily formulated as SKMs with group 
feature representation $(#,,3^) = [$i(<-fj, 3^); • • • > &m(Xi, 3^)] as follows, 

/ m \ 2 r n 

(7) rnm - ElW +-£6 

\i=l / i=l 

m 

s.t. Vi,V3>| 6 Y : £wJ<5$j(X) > A(y u yl) - 
i=i 

where w = [wi; . . . ; w m ]. The number of constraints in Problem (O depends the 
specific structure in the space of Y. In this paper, we mainly focus on sequence or 
tree structure which usually induces exponential number of constraints. Therefore, 
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the general SKMs, as well as the state-of-the-art MKLs, cannot be applied here. 
Recall that most algorithms focus on the dual problem of MKLs, but we propose 
an efficient cutting plane algorithm to solve Problem ([7]) in the primal form. 

3.3. Multiple Kernel Learning Trained in the Primal. Problem is an 
MKL problem with exponential number of constraints. The mixed norm regularizer 
makes it even hard to be solved. We first formulate it as 1-slack formulation, 

J'=l / 

s.t. vz, yy, e y : - £ £ wj^j (xo >-E A X) - e 

Z— 1 J=l 2—1 

And then, we propose Algorithm [1] to solve Problem ©. This algorithm constructs 
a working set W iteratively. In each iteration, the most violated constraint is found, 
and added into working set W. Then a subproblem is solved over W in order to 
obtain a new solution w. The algorithm stops if no constraint is found with the 
desired precision or the maximum number of iterations is reached. 



Algorithm 1 Multiple Template Learning 

1: Input: V = {{XiM ■ • ■ , (X n , C, e 

2: W = 0, W = 0, 

3: repeat 

4: Find the most violated (X, . . . , y n ) 

5: W:=WU {(%..., y n )} 

6: Obtain w by solving general QCQP subproblem over W 

7: until e-optimal 

8: return w 



(8) min 

W,f; 



Assume that finding the most violated (J^i, . . . , y n ) can be achieved given the 
model parameter w. After s iterations, we can construct a set of most violated 
label W s , the subproblem in Algorithm [1] is formulated as 

2 

(9) min 

w,{>0 




s.t. £><f + ^wjp 3 r ,Vr = l,..., S 



where Pj r = -I ^ILi ^j(5T) and 9 r = k E?=i ^.ST)- Note that subproblem 
(jQj) has s constraints. The conic dual of ((9]) can be readily derived as 

s 

(10) max max ~0 + > a r o r 

V 7 aeA s 6 

r=l 

s.t. ]^a T Q J a < 9,Vj = 1, . . . ,m 

where A s = {^^i «r < C,a r > 0,Vr = l,...,s}, and Q 3 rr , = (pj,pj'). For 
completeness, we give the derivation in Appendix A. Problem (fTU)) is in form of a 
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quadratically constrained quadratic programming (QCQP) problem, which is sim- 
ilar to the multiple kernel learning problem with small size of constraints. Further- 
more, the primal solutions can be recovered by Wj — fij X)r=i a rPjt^j — 1, • ■ • , to 
where /j, is the Lagrangian multiplier of each corresponding constraint in (I10p . 

Empirically, the number of iteration s needed for Algorithm [T] to reach e-optimal 
convergence is very small. Therefore, the QCQP problem in ([TP]) with s+1 variables 
and m + 1 constraints can be solved efficiently by a QCQP toolbox, such as Mosek 
Since Mosek simultaneously solves the primal and its dual form, the weights 
fx for each group of features can be obtained at the same time. Alternatively, one 
can apply other efficient MKL algorithms to solve flTOl) . 

It is worth mentioning that the proposed method obtains Wj = /Ltj S r =i a rPj>^3 
I />; : while Structural SVMs do not consider the weights /i, which means a uni- 
form weighting for all the group of features or templates. 

The procedure for finding most violated constraints is to solve 

TL TL 777, 

(11) (£,...,&) = arg max £ A(X, X) - £ E wJ^.(X), 

1 ' 71 i=l i—l j — 1 

where the definition of cost A (3^, y~) and the feature mapping $ depend on the spe- 
cific tasks. The input-output pairs in V are generally considered i.i.d., so Problem 
(JTTJ) can generally decomposed into n independent optimization problem as 

m 

(12) % = arg max A(^, yfi - V wjS®* (X), Vi = 1, . . . , n. 

i=i 

The stopping criterion of Algorithm [1] is defined as R emp (w s ) — R s (w s ) < e 
where the risk of upper bound and lower bound of Problem |8} are 

Ww) = £ E?=i max ^'eY (A(y,,X) - i wf <>.!.->•;:) , 

i?. s (w) = max r= i ; ... jS (jyjLi w JPj + 1 T ) ■ 

The maximum number of iterations could also be used to terminate the algorithm 
considering the time and space limitations. 

In the next section, we tackle specific tasks, namely sequence labeling and de- 
pendency parsing, where we will introduce the specific procedure for finding the 
most violated constraints for the two instantiations in details. In addition, we can 
do prediction by solving an inference Problem §5§ given an input X £ X after the 
parameter w is learned. 

4. Multiple Template Learning for Sequence Labeling 

In this section, we investigate multiple template learning algorithm in sequence 
labeling problems. We restrict the output structure to the linear chain structure 
where the class label of the current node depends on its neighbors only. Next, 
we illustrate the problem of sequence labeling, and define the feature groups by 
templates, as well as the methods for finding most violated constraints. Exten- 
sive experiments are performed on three sequence labeling tasks: Chinese word 
segmentation, Chinese part-of-speech tagging, and language-independent named 
entity recognition. 
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4.1. Sequence Labeling. Sequence Labeling is to learn from an observation se- 
quence o € O and then predict a label sequence y 6 Y instead of individual 
class labels. Formally, given a training dataset {(oj, yj}™^ with n sequence pairs 
(oj,yj) where an observation sequence = (o^i, . . . , o.y ( ) and a class label se- 
quence y i = ■ • ■ ,Vi,ii) have the same size k, the task of sequence labeling is 
to learn the hypotheses 

(13) 5 :O^Y, 

where the observation o and the label y in the pair (o, y) have the same finite size 
I. 

There are a large amount of applications which could be formulated as sequence 
labeling problems. In this paper, we explore the advantages of our proposed MTL 
in Section [3] on Chinese word segmentation, Chinese part-of-speech tagging and 
language-independent named entity recognition. Taking Chinese part-of-speech 
tagging for example, given a word sequence in Figure QJb), the goal is to predict a 
label class sequence y = (ns v t n) where ns is location name, v is a verb, t is a 
time word and n is a noun word. However, learning algorithms cannot be directly 
applied into the string representation of a sequence. In the following sections, we 
introduce the general feature representation strategy for sequence labeling problem 
and the corresponding inference methods. 

4.2. Feature Representation. As stated in Section l3~Tl the set of features can 
be extracted by a set of predefined templates. For sequence labeling problem, the 
features are first extracted from an observation sequence, and then convolutes with 
class labels. We define the feature symbol Kj(t,o) which stands for the feature ex- 
tracted from the t th token in the observation sequence o. By applying this template 
to all the observation sequences in the dataset T>, we can collect a set of features 
Kj(T>) = {Kj, . . . , K d - 3 }. The total set of features is n(T>) — {ki(V), . . . , K m (T>)}. 
Now, the token in t th position of the observation sequence o can be represented 
by a feature vector (f>j(t, o) = [S(Kj(t, o), Kj); . . . ; S(nj(t, o), )] where 5(a, b) = 1, 
if a = b, otherwise 0. Real value function can also be used to replace indicator 
function <5(., .). 

As shown in [30] , by introducing direct tensor product operator eg) and canonical 
(binary) representation of labels y E y where y — {yi, . . . , yu] is a set of unique 
class labels with the size of k, feature functions can be represented by a zero-one 
column vector: 

(14) A(y) = [6( yi ,y); S(y 2 ,y); . . . ; S(y k , y)] e {0, l} fc , 

with (A(y), A(y')) = 5(y,y'), and the direct tensor operator is defined as ® : Mr x 
R k -y R", (a <g> b) i+(i _ lK = a. t ■ b r 

Based on the features defined on observation sequence and class label sequence, 
we can define the joint feature, such as state-observation feature 

(15) *j(*,o,y) = 4> 3 (t,o) <g> A(j/ t ),Vj = l,...,m, 
and state-state feature 



(16) *(t,t -l,y) = A(t/ t _i)®A(|ft). 
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The features over observation-label sequence pairs (o, y) can be defined as the 
summation of feature vectors over all tokens in an observation sequence as 

i 

(17) *i(o,y) = ^0 i (f,o)®A(y t ),Vj = l,...,m, 

t=i 

and 

(18) *(y)=£A( lte _ 1 )®A( tft ). 

t=2 

Finally, we can obtain a feature vector $(o,y) = [$i(o, y); . . . ; $ m (o, y); ^(y)] 
to represent the observation-label sequence pairs (o, y). In this paper, we use 
the indicator function to construct features since the value in the each entry of 
$(o,y) is the frequency of this feature appeared in the given observation-label 
pair (o,y). And also for simplicity, we only implement the state-state transition 
feature without observations. Generally, high-order dependence among labels and 
state-state transition with observation in the linear chain also can be considered. 

4.3. Inference. Given a training dataset T> = {(oi, yJjiLi, we can apply m 
templates to obtain a new form of data with feature representation $(oj,yj) = 
[3>i(o, y); . . . ; $ m (oj, y,); ^(yj], Vi = 1, . . . , n. There is an implicit group "J for the 
label sequence to capture the label dependence between the current label and its 
neighbors. Before directly using Algorithm [T] to solve this task, we need to solve 
Problem as 

(19) y 4 = arg max A( yi ,y^) +w T $(o 4 ,y^) 

m 

= arg max A( Yi , yj) + ^<™i> tfO) + (w,*(y'J)- 

where w = [wi; . . . ; w m ; w]. We notice that given an observation sequence o, the 
prediction function in Problem ([5]) is exactly the Problem (fT9"| without the loss term 
A(y i; y-). According to Corollary 23 in [30] , we can derive the following equations, 



(w 3 .$ 3 (o t ,yD) = £<w J ,0 i (t,o i )®A(i4 it )> 



t=i 



(20) = J^Yt^'h&OiM^yi 

t=i aey 



t=2 

u 

(2i) = EEEv^^i^f^^) 

t=2 <t6^ a'ey 

By substituting (J20J) and JH]) into Problem ([19]) . we obtain 



(22) (w, $(oi, y0) = E^ + EE 



10 



Q. MAO AND IVOR W. TSANG 



where v a = l J2t=i ( w j>) <t>j (*, Oi))5(cT, ?^ t ) and = Y!t=2 Wa,a'5(a, y^_ 1 )(5(a 
Problem (|5j) with sequence feature representation ((20)) and (12T1) can be readily 
solved by the Viterbi algorithm [32]. In order to solve Problem (fl"9)) . we need 
to design loss function A(y i ,y' i ) which is usually defined as Hamming loss, i.e. 

A(y i; y^) = EtU (!-<%*,*> 24,t)) = h-T,* u ° where u * = Et=i%^>°')%i,i> cr )- 
Therefore, A(y i; yj) can be merged into the first term in the right-hand side of (122f) . 
so the Viterbi algorithm can be used to find the most violated constraints. 

4.4. Experiments. In this section, we evaluate our proposed method called Multi- 
ple Template Learning for sequence labeling (MTL ,lmm ) on some natural language 
processing tasks, which could be transformed as the sequence labeling problem, 
and compare it with the state-of-the-art sequence labeling algorithms such as CRF 
with a Gaussian prior over parameters (CRF^)), CRF with a Laplacian prior 
(CRF(^i)) and Structural SVMs for sequence learning (SVM /lmm ). The imple- 
mentations used for comparison are CRF++ software and SVM ?lmm software 0. 
We implement MTL^ 1 " 1 ™ in C++. For fair comparisons, we use the same feature 
templates from the information provided by training dataset without using any 
external prior knowledge. The set of templates is also defined arbitrarily without 
considering any language prior knowledge. All templates used in this paper only 
capture the local context information by uni-gram, bi-gram or tri-gram. Although 
the templates listed in the following sections may not be the state of the art, the 
goal of this paper is to examine whether or not the weighted group features can 
achieve better performance than the uniform weighting strategy adopted in CRF 
and SVM hmm . Users can add more templates in terms of their own prior knowledge 
about the specific task if it is needed. 

The template format used in sequence learning problem is borrowed from CRFH — K 
Before applying templates to datasets (including training dataset and test dataset), 
we preprocess them to be multiple tokens. In addition, each token consists of mul- 
tiple columns with a fixed number. The definition of the token depends on the task. 
For Chinese word segmentation, one token is one Chinese character in a sentence, 
while one word is a token for Chinese part-of-speech tagging and named entity 
recognition. Each column of a token represents some kinds of semantic meaning, 
such as word and part-of-spccch tag in the named entity recognition task. In each 
template, special macro %x[row,col] will be used to specify a token in the input 
data, row specifies the relative position from the current focusing token and col 
specifies the absolute position of the column. The template index is used to main- 
tain the relative position of features. For the bi-gram features, we have the different 
meaning with CRFH — h definition. In this paper, bi-gram features combine token 
information in two relative positions, while CRFH — h denotes it as the state transi- 
tion feature. We also use B to denote the state-state transition features in the form 

of dH]). 

In order to compare different algorithms fairly, we use the features generated by 
CRF++ from a given template set as the input for SVM hmm and MTL hmm . For 
CRF++, we use the default setting for all the parameters except the parameter a in 
CRF++ toolbox which is tuned in the range [10 _1 , 10°, 10 1 , 10 2 ] by command "-c". 
Command parameter "-a" is used to determine whether the CRF(^ 2 ) or CRF(^i) 



-^http: / / crfpp.sourceforge.net / 

2 http: / / www.cs.cornell.edu/People /tj/ svmjight / svm_hmm.html 
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type 


Index 


Template 


uni-gram 


UOO 


%x[-2,0] 




U01 


%x[-l,0] 




U02 


%x[0,0] 




U03 


Cr/ H r\l 

%x[l,0J 




U04 


%x[2,0] 


bi-gram 


U08 


Cr/ r -t r\~\ ICY-/ T r\ 

%x[-l,0]/%x[0,0] 




U09 


Cr/ \ r\ r\l /Cr/ r-i nl 

%x[0,0]/%x[l,0] 




T T 1 A 
U 10 


7ox[-2,OJ/7ox[-l,OJ 




Ull 


%x[l,0]/%x[2,0] 




U12 


%x[-2,0]/%x[0,0] 




U13 


%x[-l,0]/%x[l,0] 




U14 


%x[0,0]/%x[2,0] 




U15 


%x[-2,0]/%x[l,0] 




U16 


%x[-l,0]/%x[2,0] 


tri-gram 


U05 


%x[-2,0]/%x[-l,0]/%x[0,0] 




U06 


%x[-l,0]/%x[0,0]/%x[l,0] 




U07 


%x[0,0]/%x[l,0]/%x[2,0] 




B 


state transition 



Table 1 . The set of templates used for Chinese word segmentation task. 



Training Dataset 


# Sentence 


#Token 


AS 


708,953 


8,368,050 


CityU 


53,019 


2,403,355 


PKU 


19,056 


1,826,458 


MSR 


86,924 


4,050,469 



Table 2. Training dataset for Chinese word segmentation task. 



is used. The default setting of CRF++ docs not remove any features according to 
the occurred frequency in the training data. Accordingly, in addition to the default 
setting of SVM hmm , we tune C in the range of n x [1CT 1 , 10°, 10 1 , 10 2 ] to obtain 
the equivalent setting and use the recommended termination condition parameter 
setting with "-e 0.5" from SVM hmm website. The proposed algorithm SVM hmm 
has the similar setting with SVM hmm , but we also set the maximum iteration to 
500. The best performance for each algorithm is reported by tuning their own 
parameters. 

In the next subsections, we will show the detailed experiments on the natural 
language processing tasks, such as Chinese Word Segmentation, Chinese Part-of- 
Speech tagging and Named Entity Recognition. The same experimental setting are 
applied to all tasks. 

4.4.1. Chinese Word Segmentation. Chinese is written without inter- word spaces, 
such as the blank character in English, so it is very hard to find the word bound- 
aries. However, this process is an essential first step in many natural language 
processing applications such as mono- and cross-lingual information retrieval and 
text-to-speech systems [3] • Figure [1] gives an example of Chinese word segmenta- 
tion task where Figure[TJa) shows the input Chinese character sequence, and Figure 



12 



Q. MAO AND IVOR W. TSANG 



itP, mf fr¥ 



(a) the Chinese character sequence 



(b) the segmented word sequence 



Figure 1 . An example for Chinese word segmentation. 



[Tfb) shows the segmented word sequence with blank characters inserted into the 
proper position of original character sequence. 

Chinese word segmentation problem has been an active area of research in com- 
putational linguistics for several decades. Many algorithms are invented to address 
this problem in the literature. In this paper, we mainly follow the principle of 
the character tagging for Chinese word segmentation [315], especially for sequence 
character tagging methods such as CRFs [23], which has been considered as the 
state-of-the-art algorithm. In the character tagging principle, each character is at- 
tached by a specific encoded class label. The label sequence can be used to recover 
the segmented word sequence. The detailed explanation of the encoding labels can 
be observed in the following part of this section. 

The second International Chinese Word Segmentation Bakeoff task provides the 
platform to evaluate different methods, It contains four corpus, such as Academia 
Sinica (AS), City University of Hong Kong (CityU), Peking University (PKU) and 
Microsoft Research (MSR), which are available from website [E Each corpus is 
comprised of a training dataset and a ground-truth testing dataset. The detailed 
corpus information in Tablets referred to [5]. All the algorithms are trained on the 
training datasets, and then results are reported on the testing dataset by removing 
the label from the ground-truth dataset and evaluated by the following measures: 
test recall (Recall), test precision (Prec), balanced F score (Fl = 2 x P x R/(P + R)), 
and recall on in- vocabulary words (Ri„). All the measure scores are obtained by 
the attached scoring script along the corpus. 

As we discussed in the previous sections, all the methods mentioned above need 
feature templates to extract the features from labeled corpus. The set of templates 
used for Chinese word segmentation task is shown in Table [TJ We use three label 
notations such as B, I, E (B is the beginning character of a word; I is the inner 
character of a word; E is the end character of a word) to transform Chinese word 
segmentation task into a label sequence learning problem. 

To justify whether improper or redundant templates could affect sequence pre- 
diction performance or not, we divided the set of templates into two categories: 
TP1 and TP2. TP1 includes the templates indexed from U00 to U09 plus B; while 
TP2 includes the ones indexed from U00 to U16 plus B. In TP2, we deliberately 
enumerate all the possible bi-gram templates in the window size [-2,-1-2] . This con- 
struction of templates has a high probability to include redundant information. 
We run all the algorithms on the features extracted from the training dataset by 
applying TP1 and TP2, respectively. 

The testing results on the above two different set of templates are reported in 
Table [3J We observe the following facts. In TP1, four algorithms have similar 
Fl score with a difference at most 0.5%, except that CRF^) is 1% higher than 
SVM /lmm on AS. However, MT\j hmm always demonstrate the highest R iv , which 
means weighting strategy is more useful to predict the known words correctly. Even 

3 http:// www.sighan.org/bakeofT2005/ 
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Dataset 


Algorithm 


Recall Prec Fl R iv 




CRF (£ 2 ) 
CRF (£1) 
SVM hmm 
MTL hmm 


95.6 / 95.1 94.4 / 93.8 95.0 / 94.4 97.0 / 96.4 
95.4 / 95.0 94.1 / 93.4 94.8 / 94.2 96.7 / 96.4 
94.6 / 94.3 93.2 / 93.3 93.9 / 93.8 96.0 / 95.6 
95.6 / 95.8 93.9 / 93.9 94.7 / 94.8 97.2 / 97.3 


Oity U 


CRF (£ 2 ) 
CRF (£i) 
SVM hmm 
MTL hmm 


94.1 / 92.8 94.3 / 92.6 94.2 / 92.7 96.3 / 95.1 
93.4 / 92.9 93.6 / 92.9 93.5 / 92.9 95.6 / 95.1 
94.3 / 94.1 94.1 / 93.5 94.2 / 93.8 96.4 / 96.4 

95.2 / 95.2 94.2 / 94.3 94.7 / 94.8 97.6 / 97.6 


PKU 


CRF (£ 2 ) 
CRF (4) 
SVM hmm 
MTL hmm 


92.7 / 92.2 94.2 / 93.3 93.4 / 92.7 94.7 / 94.2 
91.6 / 91.8 93.1 / 92.4 92.4 / 91.7 93.8 / 93.4 
92.6 / 93.0 93.7 / 93.4 93.2 / 93.2 94.9 / 95.0 

93.8 / 94.0 93.3 / 93.6 93.6 / 93.8 96.1 / 96.2 


MSR 


CRF (£ 2 ) 
CRF (£1) 
SVM hmm 
MTL hmm 


96.5 / 95.7 96.7 / 96.0 96.6 / 95.8 97.3 / 96.5 
96.2 / 95.7 96.2 / 95.6 96.2 / 95.6 96.9 / 96.5 
96.7 / 96.1 96.5 / 96.5 96.6 / 96.3 97.6 / 96.7 
96.9 / 97.0 96.5 / 96.4 96.7 / 96.7 97.9 / 98.0 



Table 3. The Chinese word segmentation results of different al- 
gorithms on all the testing datasets. The left-hand side of / is the 
results using TP1, while the right-hand side is using TP2. 



0.3 



0.25 



ITP1 

]TP2 

-Average weight 



0.2 



^ 0.15 



0.1 



0.05 : 



JUL 



U00 U01 U02 U03 U04 U05 U06 U07 U08 U09 U10 U11 U12 U13 U14 U15 U16 B 

Index of template 



Figure 2. The learned template weig hts by MTL tlmm for Chinese 
word segmentation in TP1 and TP2, respectively, and the solid line 
for average weight. 



though MTL hmm and SYM hmm use the same set of features, MTL hmm is better 
than SVM' imm according to Fl score. This shows that the proposed weighting 
template strategy is helpful to boost the performance. 

In TP2, the similar phenomena could be observed, but the difference of Fl score 
among different methods are enlarged. For example, for CityU dataset, MTL hmm is 
2.1% higher than CRFs. MTL hmm consistently shows the best Fl score among four 
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CRF(H) 
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(a) AS 



c 

(c) PKU 













-*-CRF(l2) 

CRF(h) 








_^ MTL hmm 







-*-CRF(l2) 

80 CRF(I1) 



1 10 

C 

(b) CityU 



-£ 90 



1 10 

C 



(d) MSR 

Figure 3. Fl measure in terms of varied C in TP2. 











-±-CRF(l2) 
CRF(I1) 






— SVM hmm 
— •— MTL hmm 







algorithms. By comparing the results between TP1 and TP2, we find that adding 
more templates, such as templates J710-i716, degrades the performance of CRFs and 
SVM ,lmm , but MTL hmm obtains slight improvements. Therefore, the performance 
of MTL hmm is not greatly affected by the arbitrary added templates. This may 
due to the fact that MTL' 1 "™ can effectively identify the important templates 
from a given set of arbitrary templates, and can remove unimportant templates 
simultaneously. However, CRF^) and SVM ,lm ™ 1 do not consider this information, 
and CRF(^i) works poorly on CityU and PKU. This statement can be justified by 
the learned template weights. Figure [2] shows one example of the learned template 
weights on AS with the settings of TP1 and TP2. We can observe that most of 
the weights are assigned to templates U05 to U09 in TP1, but U07 turns out to be 
not important when adding C/10-C/16 to TP1. Moreover, U16 is also considered as 
a redundant template in TP2. We can conclude that MTL hmm can effectively and 
efficiently handle arbitrary templates without performance degradation, but CRFs 
and SVM hmm cannot. 

In addition, Figure [3] shows the variations of testing Fl measure over the range 
of C used in the experimental setting. MTL' imm obtain better results than the 
rest of algorithms in the range with large C, i.e. C > 1 in Figure [3] Therefore, 
weighting groups of features can boost the Fl measure on all four Chinese word 
segmentation datasets. CRF(^i) varies greatly in terms of different C, which implies 
that aggressively removing features could degrade prediction performance. 
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itPs/ns Wtfh fr¥/t #i£#/n 

Figure 4. An example for Chinese part of speech tagging. 

4.4.2. Chinese Part- of- Speech Tagging. Part-of-speech (POS) tagging is a task to 
assign a lexical category or part-of-speech tag to each word in a sentence. POS is 
an important component in nature language processing such as shallow parsing or 
full parsing. Figure [4] shows an example where ns is location name, v is a verb, t is 
a time word and n is a noun word. 

Similar to Chinese word segmentation, part-of-speech tagging could also be trans- 
formed as a sequence labeling task based on words in English [13l E H2] ■ Chinese 
part-of-speech tagging is more difficult to be solved due to the implicit features, 
not like the prefixes and suffixes in English which are more explicit to determine 
the part-of-speech for each word in a sentence. The only information that we can 
use is the character or word. The property of sequence learning methods based 
on multiple interacting features or long-range dependencies of the observations can 
improve hidden Markov model or Maximum Entropy model |26) . 



type 


Index 


Template 


uni-gram 


UOO 


%x[-2,0] 




U01 


%x[-l,0] 




U02 


%x[0,0] 




U03 


%x[l,0] 




U04 


%x[2,0] 


bi-gram 


U05 


%x[-2,0]/%x[-l,0] 




U06 


%x[l,0]/%x[2,0] 


tri-gram 


U07 


%x[-l,0]/%x[0,0]/%x[l,0] 




B 


state transition 



Table 4. The set of templates used for Chinese Part-Of-Speech task. 



In order to compare our proposed algorithm with CRFs and SVM' lmm on Chi- 
nese part-of-speech task, we evaluate on the first month of Peking people's daily 
newspaper in 1998, which is available from www.icl.pku.edu.cn. This dataset con- 
sists of 19,484 sentences. For simplicity, we divide the whole data into training 
dataset and test dataset to be equal number of sentences in the original sentence 
order. For training data, there are 9, 746 sentences with total 567, 886 tokens. There 
are 44 part-of-speech tags. However, tag Yg only appears in the test dataset, so 
there are 43 tags that we really consider to train a part-of-speech tagging model. 
Since all the methods we mentioned in this paper can deal with complex features 
from observations, we use the feature templates in the Table [4] Here, each token 
is a Chinese word. We consider this set of templates as an instance for comparing 
different algorithms. 



Method 


CRF (£ 2 ) 


CRF (£1) 


SVM hmm 


MTL tem 


Accuracy 


92.45 


92.59 


93.14 


93.59 



Table 5. The accuracy for Chinese Part-of-Speech task. 
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100 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 r-^— , 1 r 




nt ns nr Yg nz nx vn Ag Ng vd f d e Rg b c a n o Vg I 
part of speech tag 




m j k h Tg i w v Mg u Bg t s r q an p Dgad z y 
part of speech tag 



Figure 5. The test accuracy of different algorithms on individual 
part-of-speech tags. 



0.5 


- ■MTL hmm 

Average weight 








0.4 














1.0.3 

CD 
g 

0.2 




- 










0.1 















U00 U01 


U02 U03 U04 U05 U06 U07 
Index of template 


B 



Figure 6. The learned template weig hts by MTL hmm for Chinese 
part-of-speech tagging and the solid line for the average weight. 

All algorithms can achieve the best performance with parameters a — 100 and 
C = 100n, except a = 1 for CRF(fi). Table[5]shows the accuracy of each algorithm. 
From Table[5j we can observe that our proposed method can achieve higher accuracy 
than CBF(l 2 ), CRF(4) and SVM hmm . For analysis of the results in details, we 
plot the accuracy bars of all algorithms in terms of part-of-speech tag in Figure [5] 
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Except that Yg is not recognized because it does not appear in training dataset, 
all the algorithm cannot identify Mg and Bg. Regardless of these difficult tags 
which need more sophisticated templates designed from domain experts, CRF^) 
also cannot identify tags, such as e, Rg and o, but SYM hmm and MTL /lmm can. 
Roughly, CRF(£ 2 ) is perform worse than SYM hmm and MTL ,imm on almost all the 
tags from Figure El Therefore, we can argue that sequence learning based on large 
margin principle has much more advantages than CRF(^2) in this task. 

Even though both SVM hmm and MTL'" 7 "™ are based on large margin principle 
and use same features, their performances are slightly different. MTL hmm performs 
better or comparable with SVM hrnm . Especially for tags e, Rg, Vg, I, i, Dg, z, 
MTL hmm obtain great improvement over SVM hmm . This observation implies that 

MTL hmm learng 

weight of each template can improve the performance of the 
part-of-speech tagging, and also prove that our initial assumption is reasonable. We 
also report the template weights we learned in Figure El where average templates 
are used to represent SVM hmm and CRFs since they do not learn the weights. 
From Figure El we can clearly see that the templates U02 and U07 dominates the 
weight vectors. It means that U02 and U07 are very important templates for part- 
of-speech tagging task. It is very intuitive to explain this phenomenon since the 
features generated by U02 means the current word can indicate its part-of-speech 
tag and U07 means the nearest neighbors are the most important indicators. From 
the results shown in Table El and Figure El we can argue that the weighted template 
schema is really helpful to boost the performance of part-of-speech tagging task. 



type 


Index 


Template 


uni-gram 


UOO 


%x[-2,0] 




U01 


%x[-l,0] 




U02 


%x[0,0] 




U03 


%x[l,0] 




U04 


%x[2,0] 




U10 


%x[-2,l] 




Ull 


%x[-l,l] 




U12 


%x[0,l] 




U13 


%x[l,l] 




U14 


%x[2,l] 


bi-gram 


U05 


%x[-l,0]/%x[0,0] 




U06 


%x[0,0]/%x[l,0] 




U15 


%x[-2,l]/%x[-l,l] 




U16 


%x[-l,l]/%x[0,l] 




U17 


%x[0,l]/%x[l,l] 




U18 


%x[l,l]/%x[2,l] 


tri-gram 


U20 


%x[-2,l]/%x[-l,l]/%x[0,l] 




U21 


%x[-l,l]/%x[0,l]/%x[l,l] 




U22 


%x[0,l]/%x[l,l]/%x[2,l] 




B 


state transition 



Table 6. The set of templates used for Named Entity Recognition task 

4.4.3. Language-Independent Named Entity Recognition. Named entities are phrases 
that contain the names of persons, organizations, locations, times and quantities. 
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Training Dataset 


# Sentence 


#Token 


Spanish (ESP) 
Dutch (NED) 


8,323 
15,806 


264,715 
202,931 



Table 7. Training dataset for Named Entity Recognition task. 



For example, a sentence " U. N. official Ekens heads for Baghdad." can be tagged 
by phrases 11 [ORG U.N. ] official [PER Ekeus ] heads for [LOC Baghdad ]." where 
ORG is an organization name, PER is a person name and LOC is a location name. 
In this task, the goal is to identify all the specific types of entities in a given input 
sentence. 

Sequence tagging models, such as CRFs [16], have been used for name entities 
recognition task and demonstrate promising results. We evaluate our method and 
other methods on the CoNLL-2002 language-independent named entity recognition 
shared task, which is available from website 0. The data from the shared task 
of CoNLL-2002 consists of three files per language: one training file and two test 
files testa and testb. The first test file will be used in the development phase for 
finding good parameters for the learning system. The second test file will be used 
for the final evaluation. These data files are available for two languages: Spanish 
(ESP) and Dutch (NED). Four types of phrases are defined: person names (PER), 
organizations (ORG), locations (LOC) and miscellaneous names (MISC). The label 
notation uses B, I, O format for each entity where B is the start word of an entity, 
I is the inner word of an entity, and O is a word not belonging to any entity. Table 
[7] shows the training data description in details. Moreover, each token is assigned a 
part-of-speech (POS) tag. The set of templates used for named entity recognition 
task are defined in Table [6] where some templates include POS tag information, 
such as U10 using the POS tag two tokens before. 

Table[8jshows the experimental results evaluated on individual validation datasets 
and test datasets, respectively. From Table [5J we can observe that MTh hmm sig- 
nificantly outperforms SVM hmm and CRFs in terms of recall and F-value on both 
Spanish and Dutch datasets. Notably, the precision of MTL hmm on Dutch data 
is even better than that of CRFs by 4% as well. The overall performance could 
be observed from Table [8J in details. On Spanish dataset, MTL hmm can achieve 
higher Fl score than others in terms of PER and MISC, and is comparable on 
the recognition of ORG and PER with the best Fl score of others. On Dutch 
dataset, the improvement is more obvious. Except for PER, MTL hmm performs 
much better than others in terms of Fl. This shows the effectiveness of MTL 
on the task of named entity recognition. We also show the weights we learned on 
Spanish and Dutch dataset, respectively. Even though they are different languages, 
the dominated templates are similar, that is U02, U05 and U06. It means that 
the current word features ( U02) and the bi-gram features around the current word 
(U05 and U06) are the most important features to determine entities in the given 
set of templates. 



4 http: / / www.cnts.ua.ac.be/conll2002 / ner / 
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Spanish 


Methods 


Validation set 


Test set 


(ESP) 


Prec 


Kecall 


Fl 


Prec 


Kecall 


Fl 


T { \ f * 

LOG 


r\ -f-y T? f fl \ 

CK-b (i2 ) 


79 CtK 


/ o.lU 


7Q no 


0/1 1 A 
o4.14 


OO.UO 


7/i ni 




PDF (0. \ 


71 on 
1 1.2U 


T1 TO 


i 1.49 


OO QQ 

82.86 


04.CW 


70 a A 




b V 1V1 


67.05 


76.24 


71.35 


78.36 


67.80 


72.70 




T\ it 1 ! hmm 


68.59 


77.36 


72.71 


80.70 


69.83 


74.88 




OKr (£2 ) 


fin 


ou. ( y 


/in 77 

4U. / / 


(\a nfi 

04. U0 


oD.lo 


A f\ 9/1 




PDF ^ 


58.46 


34.16 


43.12 


57.48 


36.18 


A A An. 

44.40 




Q\r~\/\hmm 

o V 1V1 


E /I CO 

54.52 


on oo 
oy.oo 


A k c;n 
40.69 


56.56 


A A C A 


A 7 Ofi 

4( .2b 




Ti it 1 ! hmm 


61.5 ( 


07 AO 


/ic; oo 
4b. 25 


c;o 1 n 
62. 1U 


/I A AA 
4U.UU 


48.0b 




OKr (£2 J 


ol.oy 


oz.uu 


7n 


81 fi°. 
ol.Oo 




7^ 89 




PDF (/> ^ 


79.64 


59.35 


68.01 


82.20 


69.93 


75.57 




SVM hmm 


78.39 


60.18 


68.09 


78.17 


69.57 


73.62 




MTL hmm 


77.85 


62.24 


69.17 


78.25 


73.00 


75.54 


PER 


CRF (£ 2 ) 


83.15 


60.97 


70.35 


84.02 


77.96 


80.88 




CRF (£i) 


83.39 


62.03 


71.14 


82.87 


77.69 


80.20 




SVM' lmm 


84.50 


64.24 


72.99 


85.53 


80.41 


82.89 




MTL hmm 


88.01 


59.49 


71.00 


88.46 


77.14 


82.41 


Overall 


CRF (£ 2 ) 


78.00 


61.03 


68.48 


81.76 


67.52 


73.96 




CRF (£i) 


76.58 


60.34 


67.50 


80.75 


66.70 


73.00 






74.39 


62.82 


68.12 


78.17 


68.50 


73.02 






76.09 


62.32 


68.52 


79.96 


69.74 


74.50 



Dutch 
(NED) 


Methods 


Validation set 


Test set 


Prec Recall Fl 


Prec Recall Fl 


LOC 


CRF (£ 2 ) 
CRF (£i) 
SVM hmm 


82.52 49.27 61.70 
82.71 45.93 59.06 
70.94 51.98 60.00 
88.76 49.48 63.54 


82.50 55.43 66.31 
80.91 55.30 65.69 
76.21 58.79 66.37 
92.88 62.40 74.65 


MISC 


CRF (£ 2 ) 
CRF (£i) 

MTL hmm 


90.46 36.76 52.28 
86.55 41.31 55.93 
76.86 38.64 51.42 
86.71 47.99 42.42 


89.57 38.33 53.69 
86.03 42.54 56.93 
79.15 40.94 53.97 

87.58 48.69 62.59 


ORG 


CRF (£ 2 ) 
CRF (£i) 
SVM hmm 


88.51 33.67 48.79 
82.89 36.01 50.20 
83.77 32.36 46.69 
83.62 42.42 56.29 


87.60 38.44 53.43 
83.18 39.80 53.83 
81.49 35.94 49.88 
85.01 46.94 60.48 


PER 


CRF (£ 2 ) 
CRF (£i) 
SVM hmm 
MTL hmm 


64.31 55.62 59.65 
66.49 53.06 59.02 
60.28 55.48 57.78 
69.77 52.20 59.72 


73.19 70.13 71.63 

74.70 67.21 70.76 
69.07 68.12 68.59 

78.71 64.30 70.78 


Overall 


CRF (£ 2 ) 
CRF (£i) 
SVM hmm 
MTL hmm 


77.66 43.31 55.61 
77.53 43.92 56.08 
70.16 43.96 54.05 
80.64 47.94 60.13 


80.79 50.57 62.20 

80.05 51.31 62.53 
74.77 50.90 60.57 

85.06 55.34 67.06 



Table 8. The named entity recognition average performance of 
different algorithms on Spanish and Dutch datasets, respectively. 
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Figure 7. The learned template weights by MTL m on Spanish 
and Dutch, respectively, and the solid line for the average weight 



root 



Figure 8. An example dependency tree (adapted from 18). 



5. Multiple Template Learning for Dependency Parsing 

In this section, we apply the proposed multiple template learning framework to 
dependency parsing task where the output structure is a directed tree structure. 
Before directly using Algorithm [T] to solve dependency parsing problem, we first 
formulate it as the multiple template learning problem by defining compatibility 
function over tree structure and the group features from a predefined templates. 
Finally, we give the feasible inference methods for finding most violated constraints, 
as well as predicting unknown inputs. 

5.1. Dependency Parsing. The goal of dependency parsing is to find a generic 
dependency tree T x for a given generic input sentence x = (xi , . . . , Xi) with I tokens. 
One example is shown in Figure [5] with an input sentence " John hit the ball with the 
bat" . An additional input node called " root" are manually introduced as the root 
node of the dependency tree. The input sentence can be represented by augmenting 
root node to be x = [xq, x\, . . . ,xi) where xq = root. The dependency tree is a 
tree structure with directed edges over the input nodes. 

[18 showed that dependency parsing can be formulated as the search for a maxi- 
mum spanning tree in a directed graph. The generic dependency tree can be repre- 
sented as a directed graph T x = (V x , £ x ) by its vertex set V x = {xo, Xi,..., xi} and 
directed edge set £ x C {(u,v) : u ^ v, (u,v) e [0 : I] x [1 : I]} under the following 
constraints that 7x is weakly connected and all the nodes in V x have an in-degree 
of exactly one except the unique root node with in-degree zero. The constraints 
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actually make a dependency graph be a dependency tree structure. There exist two 
types of dependency trees. One is called projective dependency tree (see Figured]) 
in which a word and its descendants form a contiguous substring of a sentence. If 
the sentence with a root as the first word in their linear order, the edges drawn 
above the words have no crossings. If there is at least one crossing in a tree, it is 
called non-pro jective dependency tree. 

The goal of dependency parsing is to learn the hypotheses 

(23) / : X -> T, 

where X is the domain of generic input sentences, and T is the set of feasible 
dependency trees in X. The score function s : X x T — > K is defined to be the 
compatibility function, so that we can obtain the hypotheses by maximizing the 
scores over all the feasible trees of the input sentence x, 

(24) /(x) = argmaxs(x,7;). 

The score functions are usually parameterized functions over a set of features which 
are extracted from training dataset V = {(x^, 7iJ}™ =1 . Next, we explain the de- 
tailed definition of score function on tree structures and the corresponding feature 
representation. 



5.2. Feature Representation. By factoring the score of the tree T x into the 
sum of edge scores [8], we have made dependency parsing equivalent with finding 
maximum spanning trees. Given an edge (u, v) € £ x in the dependency tree 7i, the 
score of this edge could be defined as the inner product between a high dimensional 
feature representation (/>(u,v,x) of the edge and a weight vector w, that is the 
score function s(u, v, x;w) = (w, <j>(u, v, x)). Correspondingly, the score of the 
dependency tree 7i is 

(25) s(7i,x;w)= ^2 s(u,v,x;w) = ^ (w, 0(u, v, x)). 

(u,u)e£jc (u,v)e£ x 

Assuming that the feature representation <f> and the weight vector w are given, 
dependency parsing can be considered to solve Problem (|24|) . 

Since the feature representation 0(u,w,x) is formulated on the position u,v and 
the input sentence x, we can define templates to extract overlapping features on the 
whole observation sentence with the direction of attachment as well as the distance 
between x u and x v creating the dependency. The feature dimension of <f> is usually 
very high, but it is very sparse, so the sparse vector presentations are useful to 
reduce the calculation to linear in the number of active features for a given edge. 

For intuitive understanding, one template can be defined as "the words in u and 
v position" . Given a sentence "root John hit the ball with the bat" , one of the 
features by applying the given template is 



cf){u = 2,v = 4) 



1 if x u = "hit" and x v = "ball" 
otherwise. 



Generally, more templates are needed to obtain high accuracy for dependency pars- 
ing. For instance, the features take the form of a POS trigram for all words linearly 
between the parent and the child, which is particularly helpful for nouns identifying 
their parents [17]. Similar to sequence learning problems, we can group the subset of 
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features together extracted from each template and form a group of features. Sup- 
posing that there are m templates, we can apply j th template over each sentence- 
tree pair (__i,7__ 4 ) to construct a feature vector $j(x, T Xi ) — J2( U v)e£ x w > x ) 
where <f)j(u,v, x) is the vector representation of all the features in j th group after 
applying j th template to T>. Therefore, we can obtain a group representation of 
features as ^(x,T Xi ) = [$i(x, T Xi ); . . . ; $ m (x, %,)}. 

5.3. Inference. Given the model parameter w = [wi; . . . ; w m ], we can predict the 
dependency parsing tree 7i by solving Problem (|24p given an input sentence x. 
According to the features defined in Section 15.21 the prediction problem can be 
rewritten as 

(26) T x = argmax(w, 3>(x,7i)) 

m 

= argmax5Z(w,-,$ 3 -(x,7i)) 

3=1 

rn 

= argmaxj^ ^ ( w i»^( u . w . x )) 

3 = 1 (u,v)e£ x 

According to [IS] , there are two algorithms which can solve Problem (|27p in terms 
of two types of dependency trees, respectively. For projective dependency tree, 
Eisner algorithm [8] has a runtime of 0(l 3 ), while Chu-Liu-Edmonds [6] provides 
non-projective parsing complexity 0(l 2 ) where I is the length of the input sentence. 

The most violated constraint should be the constraint with the dependency tree 
7i 4 such that Vi = 1 , . . . , n 

(27) % = argmaxA(Tx,,7~)-w T <5$*(7~) 

= argmaxA(7i 1 ,7I) + (w, $(xi,7I)). 

The loss for dependency parsing is defined as the number of words that have the 
incorrect parent, i.e. A(7i 4 ,77) = £«=o E(„', v )e£i i 1 ~ wher e 

x n is the parent node, and x v is the child node. The inference algorithms are 
employed to solve Problem ([2"T]) as well. This is due to the decomposition property 
of loss function which is closely related to Hamming loss used in the sequence 
learning. 

5.4. Experiments. We evaluate MTL for dependency parsing by comparing with 
MSTParser softwar^, which implements an online algorithm for dependency pars- 
ing [T7] . We adapt MSTParser for multiple template learning in Java called MTL- 
Parser for projective dependency parsing. The comparison is mainly to exam- 
ine whether multiple template learning algorithms can boost the performance by 
learning the weight of each template simultaneously. We perform comparisons on 
CoNLL-X shared task: Multi-lingual Dependency Parsing. This shared task con- 
sists of datasets from 13 different languages, in our experiments, we choose two 
datasets: Danish and Swedish, which are available from website |5 For each lan- 
guage, the dataset consists of two data files. One is the training dataset, while 



'http: / /sourceforge.net/projects/mstparser/ 
'http://nextens.uvt.nl/-conll/post_task_data.html 
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Dataset 


^Features 


^Templates 


MSTParser(%) 


MTLParser(%) 


Accuracy 


Complete 


Accuracy 


Complete 


Danish 


2,137,252 


568 


86.79 


30.12 


86.86 


32.61 


Swedish 


2,292,216 


612 


83.59 


38.56 


83.95 


40.62 



Table 9. The performance of dependency parsing on two 
datasets. ^Features is the number of features used in both MST- 
Parser and MTLParser. ^template is the number of templates 
naturally formed groups over all features in MTLParser. Accuracy 
is the number of words that correctly identified their parents in 
the tree. Complete is the number of sentences for which the entire 
dependency tree are correct. 



the other is the ground truth test dataset. The training dataset is used to train 
the prediction model, and then testing results are reported on the ground truth 
test dataset. In each dataset, sentences separated by a blank line. A sentence 
consists of at least one token, and each token consists of ten fields, but we only 
use the following fields: Token counter starting from 1 for each new sentence 
(ID), word/punctuation symbol (FORM), Lemma of word form (LEMMA), Coarse- 
grained POS Tag (CPOSTAG), Fine-grained POS tag (POSTAG), the head of 
dependency relation (HEAD). 

We use the templates implemented in MSTParser. We take the following tem- 
plates: (1) POS (including CPOSTAG and POSTAG) trigrams: the POS of the 
head, that of the modifier and that of a word in between, for all distinct POS tags 
for the words between the head and the modifier. Each relative position from the 
head to the modifier can be considered as a different type of template. (2) The 
form of POS 4-gram: The POS of the head, modifier, word before/after head and 
word before/after modifier. (3) Two items: each template consists of two observa- 
tions, e.g. head word, head POS/LEMMA, child word, and child POS/LEMMA. 
All templates are conjoined with the direction of attachment as well as the distance 
between the two words creating the dependency. For the distance is longer than 
10, it is set to 10; If between 10 and 5, it is 5, otherwise it does not change. 

Table |9] shows the results of MSTParser and MTLParser on two language- 
independent dependency parsing datasets. We run MSTParser in different epochs 
ranging from 10 to 30, and report the best results, while the parameter C in MTL- 
Parser is set to lOOOn. We observe that MTLParser can obtain comparable results 
with MSTParser in terms of accuracy, but there is a great improvement more than 
2% on complete correct dependency trees. This is due to the exploration of the im- 
portance of each template. Figure [9] shows the learned weights on Danish dataset. 
Several templates are highly weighted, while most of the rest approach to zero. 
This interpretability can be beneficial for the understanding of syntax of natural 
languages. According to the number of groups we used in this task, we can conclude 
that our MTL framework can handle hundreds of templates over a large number of 
instances and high dimensional data. This mainly attributes to the proposed MKL 
method for training in the primal by cutting plane algorithm. 



24 



Q. MAO AND IVOR W. TSANG 



0.2r 



0.18 



■■Learned weight 
Average weight 



0.16- 

0.14 

0.12- 



.9> 0.1 

0} 



0.08- 



0.06- 
0.04- 



0.02- 



100 200 300 400 
Index of template 



500 



Figure 9. The weight of each template learned by MTLParser 
on Danish dataset, and the red solid line for the average weight. 



6. Conclusion 

Structured prediction is an important modeling strategy for various real world 
applications. Instead of modeling the structure of specific application, in this paper, 
we explore the underlying feature structure over the input-output pairs which con- 
tribute to the success of discriminative models, such as CRFs and Structural SVMs. 
Structured prediction algorithms take more effort on how to model the interdepen- 
dence among the output variables, but less consideration is taken on the feature 
engineering which is a non-trivial task for general users. To alleviate this issue, we 
propose a Multiple Template Learning (MTL) paradigm to learn both the weight 
of each template and the structured prediction model, simultaneously. Given a set 
of predefined templates, the features extracted from each individual template can 
be naturally formed as a group. Learning the weights of these groups is formulated 
as a Multiple Kernel Learning (MKL) problem. We proposed to solve this MKL 
problem in the primal by an efficient cutting plane algorithm. MTL framework can 
be easily instantiated for specific applications which inherits from Structural SVMs. 
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Two special cases are explored in our proposed MTL framework, i.e. sequence la- 
beling and dependency parsing. Extensive experimental results demonstrate that 
learning structured prediction model with weighting template can automatically 
interpret the importance of each templates, so users can define templates without 
cautions. It is helpful to prevent degradation of prediction model after poor or 
even conflicting templates are added by users. On the other hand, it can boost the 
prediction performance by weighting the features among different groups. 



Appendix A. The Conic Dual of Problem © 

Without loss of generality, we denote w = [wi; . . . ; w m ] and p r = [p\; . . .; Tp r m 
Problem ([9]) becomes 



mm 

w,5>0 



3=1 



s.t. £ > q r + w T p r ,Vr = 1, . . . ,s 

By introducing a new variable u G K and moving out summation operator from 
objective to be a constraint, we can obtain the equivalent optimization problem as 

min — u 2 + C£ 

w,£>0 2 

s.t. £ > q r + w T p r , Vr = 1, . . . , s 

m 
3=1 

We can further simplify above problem by introducing another variables p G W 71 
such that ||wj || < pj, Vj = 1, . . . , m + 1 to be 

min -u 2 + C£ 

w,£>0 2 

s.t. £ > q + w T p r , Vr = 1, . . . , s 



J^Pi <u, ||wj|| <p j ,Vj = l,...,m. 



3=1 



We know that ||wj|| < pj is a second-order cone constraint. Following the recipe 
of [5], the self-dual cone 1 1 Vj 1 1 2 < r)j,Vj = 1, . . . , m can be introduced to form the 
Lagrangian function 

£(w,£,u,p;a,T,7,v,77) = -u + C£ - ^ Q r (£ - q r - w T p r ) - r£ + 7 I ^ p 3 ■- 

r=l 
m 

3=1 

with dual variables a r 6 R+, r S M+, 7 G R+. The derivatives of the Lagrangian 
with respect to the primal variables have to vanish which leads to the following 



26 



Q. MAO AND IVOR W. TSANG 



KKT conditions: 

s 

(28) vj = a rP r j,Vj = 1, ■ ■ ■ , m 

r=l 

s 

(29) C -^a r -t = 

r=l 

(30) u = 7 

(31) ■y = Pj ,\/j = l,...,m 

By substituting all the primal variables with dual variables by above KKT condi- 
tions, we can obtain the following dual problem, 



max 

a, 7 



1 3 

1 2 . r 

~lp + 2_^a r q 

r=l 

s 

s.t. I ^QrP^I < 7,Vj = 1, . . . ,m 

r=l 

s 

a r < C, a r > 0, Vr = 1, . . . , m 

r = l 

By setting 9 — ^-f 2 and A s = {%2 r =i a r — C,a r > 0,Vr = l,...,s}, we can 
reformulate above problem as 

s 

max — 8+ a r q r 
e,aeA s ^ 

r — l 

1 T 

s.t. -a Q a < 6, Vj = 1, . . . , m 

where Q 3 r r , = (pj, ). According to the property of self-dual cone, we can obtain 
the primal solution from its dual as vv, = [ijVj = J2 r =i a rPj where fij is the 
dual variable of the j th quadratic constraint such that Y^jLi Vj = S R+,Vj = 
1, . . . ,m. 
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